The idea is to create a ‘custom’ distance matrix, to induce the grouppings.
\(c_1\) is within the same species - but no unassigned
\(c_2\) is between assigned species
\(c_3\) is between species and unassigned
\(c_4\) is within unassigned and unassigned
For this document we will use only the following configuration:
\(c_1=1\), \(c_2=1000\), \(c_3=10\) , \(c_4=10\)
A priori, the ’probability` of unassigned ASVs to agglomerate between themselves is the same as if it was to agglomerate with other species. Large \(c_2\) makes it harder for known species ASVs to group between themselves.
\(\delta^2_{ij}\) is an indicator function (1 if obs\(i\) and \(j\) are from the same cluster, 0 otherwise). \(SS_A\) Then difference between the overall and the within groups sum-of-squares (\(SS_{A}=SS_{T}-SS_{W}\)).
\[ F = \frac{\frac{SS_{A}}{p-1}}{\frac{SS_{W}}{N-p}} \]
The idea of the new metric is: what if we represent the heterogeneity of a cluster by its maximum distance. Therefore, the average maximum distance within tends to decrease as the number of cluster increases.
In counterpart, let’s think the min distance between clusters as how different two clusters are. Thus, the average minimum distance between clusters also tends to decrease as the number of clusters increase.